Using Data Mining Techniques into Real Estates Industry

Author: Madalina-Alina Racovita, 1st year master's student on Computational Optimization at Faculty of Computer Science, UAIC, Iasi, Romania

title

Import dependencies & environment configuration

In [1]:
!pip install MiniSom
!pip install yellowbrick
!pip install PyQt5
Requirement already satisfied: MiniSom in c:\tools\anaconda3\lib\site-packages (2.2.5)
Requirement already satisfied: yellowbrick in c:\tools\anaconda3\lib\site-packages (1.1)
Requirement already satisfied: scikit-learn>=0.20 in c:\tools\anaconda3\lib\site-packages (from yellowbrick) (0.22)
Requirement already satisfied: scipy>=1.0.0 in c:\tools\anaconda3\lib\site-packages (from yellowbrick) (1.2.1)
Requirement already satisfied: cycler>=0.10.0 in c:\tools\anaconda3\lib\site-packages (from yellowbrick) (0.10.0)
Requirement already satisfied: numpy>=1.13.0 in c:\tools\anaconda3\lib\site-packages (from yellowbrick) (1.18.1)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in c:\tools\anaconda3\lib\site-packages (from yellowbrick) (3.0.3)
Requirement already satisfied: joblib>=0.11 in c:\tools\anaconda3\lib\site-packages (from scikit-learn>=0.20->yellowbrick) (0.14.0)
Requirement already satisfied: six in c:\tools\anaconda3\lib\site-packages (from cycler>=0.10.0->yellowbrick) (1.14.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\tools\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\tools\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.4.6)
Requirement already satisfied: python-dateutil>=2.1 in c:\tools\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.1)
Requirement already satisfied: setuptools in c:\tools\anaconda3\lib\site-packages (from kiwisolver>=1.0.1->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (45.2.0.post20200210)
Requirement already satisfied: PyQt5 in c:\tools\anaconda3\lib\site-packages (5.14.2)
Requirement already satisfied: PyQt5-sip<13,>=12.7 in c:\tools\anaconda3\lib\site-packages (from PyQt5) (12.7.2)
In [2]:
import pandas as pd
import os
import matplotlib
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
import numpy as np

warnings.filterwarnings('ignore')
matplotlib.style.use('seaborn')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)

Load dataframes

In [3]:
os.listdir('./Data')
Out[3]:
['2d-10c.dat',
 'iris.csv',
 'long.data',
 'order2-3clust.csv',
 'smile.csv',
 'square.data']
In [4]:
df_2d10c = pd.read_table('./Data/2d-10c.dat', names=['Feature1', 'Feature2', 'ClusterIndex'], sep='\s+')
df_2d10c.head()
Out[4]:
Feature1 Feature2 ClusterIndex
0 12.757900 -4.81962 0
1 -0.298175 -5.03868 0
2 9.155580 -3.81778 0
3 4.298230 -6.25105 0
4 8.833200 -3.48504 0
In [5]:
df_iris = pd.read_csv('./Data/iris.csv', names = ['SepalLengthCm','SepalWidthCm','PetalLengthCm',
                                                  'PetalWidthCm', 'SpeciesIndex'])
df_iris.head()
Out[5]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm SpeciesIndex
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
In [6]:
df_long = pd.read_table('./Data/long.data', names=['Feature1', 'Feature2', 'ClusterIndex'], sep='\s+')
df_long.head()
Out[6]:
Feature1 Feature2 ClusterIndex
0 0.167967 -0.171240 0
1 -1.257550 0.054219 0
2 1.648140 0.027165 0
3 -0.124180 0.038627 0
4 2.458720 -0.039070 0
In [7]:
df_order2_3c = pd.read_table('./Data/order2-3clust.csv', names=['Feature1', 'Feature2', 'ClusterIndex'], sep=',')
df_order2_3c.head()
Out[7]:
Feature1 Feature2 ClusterIndex
0 -8.804895 104.458571 0
1 -1.409040 37.172486 0
2 7.336955 82.018129 0
3 5.751823 61.828626 0
4 -5.806146 57.801112 0
In [8]:
df_smile = pd.read_table('./Data/smile.csv', names=['Feature1', 'Feature2', 'ClusterIndex'], sep=',')
df_smile.head()
Out[8]:
Feature1 Feature2 ClusterIndex
0 -0.674245 -0.664492 0
1 0.846659 -0.538960 0
2 0.536777 -0.835193 0
3 -0.412617 -1.025037 0
4 -0.776213 -0.494642 0
In [9]:
df_square = pd.read_table('./Data/square.data', names=['Feature1', 'Feature2', 'ClusterIndex'], sep='\s+')
df_square.head()
Out[9]:
Feature1 Feature2 ClusterIndex
0 6.10273 6.28359 0
1 5.46167 8.29113 0
2 6.50055 4.81082 0
3 5.00059 9.02367 0
4 3.66494 11.85230 0

Evaluation metrics

For the validation part of the cluterisation, the Adjusted rand Index is going to be used as an evaluation metrics. Given the knowledge of the ground truth class assignments labels_true and our clustering algorithm assignments of the same samples labels_pred, the adjusted Rand index is a function that measures the similarity of the two assignments, ignoring permutations and with chance normalization.

In [10]:
from sklearn import metrics

labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

metrics.adjusted_rand_score(labels_true, labels_pred)
Out[10]:
0.24242424242424246

Clustering models implementation

Firstly, there are going to be established the implementations for the clusterization methods, after that, there is going to be analyzed the performance of eah synthetic dataset enumerated previously.

Overview on the true clusterization

In [11]:
from mpl_toolkits.axes_grid1 import make_axes_locatable

def see_true_clusterization(feature_one, feature_two, true_labels, xlabel, ylabel, fig=None, ax=None, size=None):
    if size is None:
        size = 11
        
    if ax is None and fig is None:
        fig = plt.figure(figsize=(10,7))
        ax = fig.add_subplot(111)
    
    scatter = ax.scatter(feature_one, feature_two,
                         c=true_labels, cmap='gist_rainbow', alpha=0.6, s=size)
    ax.set_title('Overview on the true clusterization')
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    
    divider = make_axes_locatable(ax)
    cax = divider.append_axes('right', size='5%', pad=0.05)
    fig.colorbar(scatter, cax=cax, orientation='vertical')    

Hierarchical clustering

In [12]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
    
def hierarchycal_clustering(no_of_clusters, condensed_data_matrix, intracluster_metric):
    cluster = AgglomerativeClustering(n_clusters=no_of_clusters, 
                                      affinity='euclidean', linkage=intracluster_metric)
    cluster.fit_predict(condensed_data_matrix)
    return cluster.labels_
In [13]:
def visualize_clusterization(feature_one, feature_two, true_labels, xlabel, ylabel, 
                             predicted_labels, title, fig=None, ax=None, size=None):
    
    if ax is None and fig is None:
        fig = plt.figure(figsize=(10,7))
        ax = fig.add_subplot(111)
    
    adj_rand_index = round(metrics.adjusted_rand_score(true_labels, predicted_labels), 6)
    textstr = 'Adjusted Rand Index = ' + str(adj_rand_index)
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)

    if size is None:
        size=11
        
    ax.text(0.05, 0.95, textstr, transform=ax.transAxes, fontsize=12,
            verticalalignment='top', bbox=props)
   
    scatter = ax.scatter(feature_one, feature_two,
                         c=predicted_labels, cmap='gist_rainbow', alpha=0.6, s=size)
    ax.set_title(title)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    
    divider = make_axes_locatable(ax)
    cax = divider.append_axes('right', size='5%', pad=0.05)
    fig.colorbar(scatter, cax=cax, orientation='vertical')
    
    return plt, adj_rand_index

Single Linkage (nearest neighbour)

Proximity between two clusters is the proximity between their two closest objects. This value is one of values of the input matrix. The conceptual metaphor of this built of cluster, its archetype, is spectrum or chain. Chains could be straight or curvilinear, or could be like "snowflake" or "amoeba" view. Two most dissimilar cluster members can happen to be very much dissimilar in comparison to two most similar. Single linkage method controls only nearest neighbours similarity.

In [14]:
labels_pred = hierarchycal_clustering(max(list(df_2d10c['ClusterIndex'].unique())) + 1, 
                                      df_2d10c[['Feature1', 'Feature2']], 'single')

Average Linkage (Method of between-group average linkage (UPGMA))

Proximity between two clusters is the arithmetic mean of all the proximities between the objects of one, on one side, and the objects of the other, on the other side. The metaphor of this built of cluster is quite generic, just united class or close-knit collective; and the method is frequently set the default one in hierarhical clustering packages. Clusters of miscellaneous shapes and outlines can be produced.

In [15]:
labels_pred = hierarchycal_clustering(max(list(df_2d10c['ClusterIndex'].unique())) + 1, 
                                      df_2d10c[['Feature1', 'Feature2']], 'average')

Complete Linkage (farthest neighbour)

Proximity between two clusters is the proximity between their two most distant objects. This value is one of values of the input matrix. The metaphor of this built of cluster is circle (in the sense, by hobby or plot) where two most distant from each other members cannot be much more dissimilar than other quite dissimilar pairs (as in circle). Such clusters are "compact" contours by their borders, but they are not necessarily compact inside.

In [16]:
labels_pred = hierarchycal_clustering(max(list(df_2d10c['ClusterIndex'].unique())) + 1, 
                                      df_2d10c[['Feature1', 'Feature2']], 'complete')

Ward variance hierarchical clustering (minimal increase of sum-of-squares (MISSQ))

Proximity between two clusters is the magnitude by which the summed square in their joint cluster will be greater than the combined summed square in these two clusters: SS12−(SS1+SS2). (Between two singleton objects this quantity = squared euclidean distance / 2.) The metaphor of this built of cluster is type. Intuitively, a type is a cloud more dense and more concentric towards its middle, whereas marginal points are few and could be scattered relatively freely.

In [17]:
labels_pred = hierarchycal_clustering(max(list(df_2d10c['ClusterIndex'].unique())) + 1, 
                                      df_2d10c[['Feature1', 'Feature2']], 'ward')

Density based clustering algorithms

DBSCAN

DBSCAN is Density Based Spatial Clustering of Applications with Noise.

Coming to the algorithm, below are the steps in which DBSCAN works:

  1. Find the points in the eps neighborhood of every point, and identify the core points with more than minPoints neighbors.
  2. Find the connected components of core points on the neighbor graph, ignoring all non-connected points.
  3. Assign each non-core point(outlier) to a nearby cluster if the cluster is an eps neighbor, otherwise assign it to noise.

DBSCAN is used to identify the associations and relationships in data that are hard to find manually but that can be relevant and thus are used to find patterns and predict trends in the data.

In [18]:
from sklearn.cluster import DBSCAN

def perform_dbscan(data, dbscan_min_samples):
    clustering = DBSCAN(min_samples=dbscan_min_samples).fit(data)
    return clustering.labels_

K-means

K-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. It is popular for cluster analysis in data mining.

In [19]:
from sklearn.cluster import KMeans

def perform_kmeans_known_n_clusters(n_clusters, data, feature1, feature2, true_labels, xlabel, 
                                    ylabel, title, fig=None, ax=None, size=None):
    
    kmeans_model = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, random_state=0)
    labels_pred = kmeans_model.fit_predict(data)
    
    plot, kmeans_adj_rand_index = visualize_clusterization(feature1, feature2, true_labels, 
                                   xlabel, ylabel, labels_pred, title, fig, ax, size)

    # get centers for plot
    centers = kmeans_model.cluster_centers_
    ax.scatter(centers[:, 0], centers[:, 1], c='black', s=30, alpha=0.75)
    return kmeans_model, kmeans_adj_rand_index

Expectation Maximization (EM)

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In [20]:
from sklearn.mixture import GaussianMixture

def perform_gmm(n_clusters, data):
    gmm = GaussianMixture(n_components=n_clusters, random_state=0)
    gmm.fit(data)
    
    labels_pred = gmm.predict(data)
    return labels_pred

Self-organizing maps (SOM)

A self-organizing map (SOM) is a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map, and is therefore a method to do dimensionality reduction. Self-organizing maps differ from other artificial neural networks as they apply competitive learning as opposed to error-correction learning (such as backpropagation with gradient descent), and in the sense that they use a neighborhood function to preserve the topological properties of the input space.

The algorithm:

  1. Each node’s weights are initialized.
  2. A vector is chosen at random from the set of training data.
  3. Every node is examined to calculate which one’s weights are most like the input vector. The winning node is commonly known as the Best Matching Unit (BMU).
  4. Then the neighbourhood of the BMU is calculated. The amount of neighbors decreases over time.
  5. The winning weight is rewarded with becoming more like the sample vector. The nighbors also become more like the sample vector. The closer a node is to the BMU, the more its weights get altered and the farther away the neighbor is from the BMU, the less it learns.
  6. Repeat step 2 for N iterations.
In [21]:
from minisom import MiniSom    

def perform_som_clusterization(n_clusters, data, feature1, feature2, true_labels, xlabel, 
                               ylabel, title, size=None):
    # data normalization
    data = data -  np.mean(data, axis=0)
    data /= np.std(data)

    # Initialization and training
    som_shape = (1, n_clusters)
    som = MiniSom(som_shape[0], som_shape[1], 2, sigma=.3, learning_rate=.5,
                  neighborhood_function='gaussian', random_seed=0)

    som.pca_weights_init(data)
    som.train_batch(data, 1000, verbose=False)

    # each neuron represents a cluster
    winner_coordinates = np.array([som.winner(x) for x in data]).T

    cluster_index = np.ravel_multi_index(winner_coordinates, som_shape)

    fig = plt.figure(figsize=(7,9))
    ax = fig.add_subplot(111)
        
    adj_rand_index = round(metrics.adjusted_rand_score(true_labels, cluster_index), 6)
    textstr = 'Adjusted Rand Index = ' + str(adj_rand_index)
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
    ax.text(0.05, 0.95, textstr, transform=ax.transAxes, fontsize=11,
            verticalalignment='top', bbox=props)
    
    # plotting the clusters using the first 2 dimentions of the data
    if size is None:
        size = 11

    for c in np.unique(cluster_index):
        ax.scatter(data[cluster_index == c, 0],
                    data[cluster_index == c,1], label='cluster='+str(c), alpha=.7, s=size)

    # plotting centroids
    for centroid in som.get_weights():
        ax.scatter(centroid[:, 0], centroid[:, 1], marker='x', 
                    s=80, linewidths=35, color='k', label='centroid')
    
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    ax.legend(loc='upper center', bbox_to_anchor=(1.13, 1.0), shadow=True, ncol=1)
    
    return adj_rand_index

Visualization of each clusterization model

In [22]:
def perform_all_clusterization_types(dataframe, name_for_feature1, name_for_feature2, true_labels_column, 
                                     dbscan_min_samples, size=None):
    
    # config the parameters for the clusterizations
    true_labels = dataframe[true_labels_column]
    feature1 = dataframe[name_for_feature1]
    feature2 = dataframe[name_for_feature2]
    data = dataframe[[name_for_feature1, name_for_feature2]]
    n_clusters = max(list(true_labels.unique())) + 1
    params_clustering_model = [n_clusters, data.values]
    params_scatter_plot = [feature1, feature2, true_labels, name_for_feature1, name_for_feature2]

    # overview on the real clusterization
    fig, axs = plt.subplots(1, 2, figsize=(20, 10))
    see_true_clusterization(feature1, feature2, true_labels, name_for_feature1, name_for_feature2, fig, axs[0], size)


    # single linkage
    labels_pred = hierarchycal_clustering(*params_clustering_model, 'single')
    plot, single_linkage_adj_rand_index = visualize_clusterization(*params_scatter_plot, labels_pred, 
                                                                   'Clusterization using single linkage', fig, axs[1], size)

    # average linkage
    fig, axs = plt.subplots(1, 2, figsize=(20, 10))
    labels_pred = hierarchycal_clustering(*params_clustering_model, 'average')
    plot, average_linkage_adj_rand_index = visualize_clusterization(*params_scatter_plot, labels_pred,
                                                                    'Clusterization using average linkage', fig, axs[0], size)

    # complete linkage
    labels_pred = hierarchycal_clustering(*params_clustering_model, 'complete')
    plot, complete_linkage_adj_rand_index = visualize_clusterization(*params_scatter_plot, labels_pred, 
                                                                     'Clusterization using complete linkage', fig, axs[1], size)

    # ward variance hierarchical clustering
    fig, axs = plt.subplots(1, 2, figsize=(20, 10))
    labels_pred = hierarchycal_clustering(*params_clustering_model, 'ward')
    plot, ward_variance_adj_rand_index = visualize_clusterization(*params_scatter_plot, labels_pred, 
                                                                  'Clusterization using ward variance', fig, axs[0], size)

    # density clustering: DBSCAN
    labels_pred = perform_dbscan(data, dbscan_min_samples)
    plot, dbscan_adj_rand_index = visualize_clusterization(*params_scatter_plot, labels_pred, 
                                                           'Clusterization using DBSCAN', fig, axs[1], size)

    # k-means with given k-partitions number
    fig, axs = plt.subplots(1, 2, figsize=(20, 10))
    model_kmeans, kmeans_adj_rand_index = perform_kmeans_known_n_clusters(*params_clustering_model, 
                                                                          *params_scatter_plot, 
                                                                          'Clusterization using K-means', fig, axs[0], size)

    # expectation maximization: soft clustering
    labels_pred = perform_gmm(*params_clustering_model)
    plot, em_adj_rand_index = visualize_clusterization(*params_scatter_plot, labels_pred, 'Clusterization using EM', 
                                                       fig, axs[1], size)

    # self-organizing maps
    som_adj_rand_index = perform_som_clusterization(*params_clustering_model, 
                                                    *params_scatter_plot, 
                                                    'SOM clusterization', size)
    
    return [single_linkage_adj_rand_index, average_linkage_adj_rand_index,
            complete_linkage_adj_rand_index, ward_variance_adj_rand_index,
            dbscan_adj_rand_index, kmeans_adj_rand_index, em_adj_rand_index,
            som_adj_rand_index]

Dataframe 2d-10c.dat

Visualization of each clusterization model

In [23]:
clusterization_results = perform_all_clusterization_types(df_2d10c, 'Feature1', 'Feature2', 'ClusterIndex', 0, 7)

Results centralization

In [24]:
results_df = pd.DataFrame({'Clustering_algorithm': ['Single linkage', 'Average linkage', 'Complete linkage', 'Ward variance',
                                                    'DBSCAN', 'K-means', 'EM', 'SOM'],
                           'Result': clusterization_results}).sort_values(by=['Result'], ascending=False)
results_df
Out[24]:
Clustering_algorithm Result
6 EM 0.998209
4 DBSCAN 0.899050
5 K-means 0.723577
1 Average linkage 0.661878
3 Ward variance 0.661219
7 SOM 0.560554
2 Complete linkage 0.517892
0 Single linkage 0.304839

Performance analysis

From the above results, it is obvious that EM manages to cluster almost perfectly the data, since the isocontour curves are likely to curves of gaussian distributions. More specifical, the entire dataset seems to be formed of sample generated by a gaussian mixture of distribution. On this line, DBSCAN manages to perfom quite good, since the elongate clusters are dense, and the number of outliers is quite small.

K-means and the hierarchical clustering based on ward variance are performing almost the same, they manages to cluster correctly the globular clusters but most of tge clusters which are elongate, are clustered in an incorrect manner, by mixinga half of an elongate cluster with a half from another one.

Single linkage discens 3 main clusters sinces the elongate clusters are very close to each other, and some outliers will form due to this aspect isolated clusters of cardinal one.

Since the clusters are not very spherical, SOM performs much worse than K-means.

Dataframe iris.csv

In [25]:
df_iris.head()
Out[25]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm SpeciesIndex
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
In [26]:
corrMatrix = df_iris.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

I will plot the data into a 3D format just to have an overview about how the data is clustered in $\mathbb{R}^3$. The correlation matrix will be useful while plotting the clustering results in 2d format (firstly on 2 features, after on the other 2 ones). The combinations of 2 features will be chosen so that the groups will be correlated, if possible.

In [27]:
from mpl_toolkits import mplot3d
import plotly.express as px

fig = plt.figure(figsize=(15,15))
fig = px.scatter_3d(df_iris, 
                    x='SepalLengthCm',
                    y='SepalWidthCm', 
                    z='PetalLengthCm',
                    color='SpeciesIndex', 
                    opacity=0.9, 
                    size='PetalWidthCm')
fig.show()
<Figure size 1080x1080 with 0 Axes>

Visualization of each clusterization model

In [28]:
clusterization_results = perform_all_clusterization_types(df_iris, 'SepalWidthCm', 'PetalWidthCm', 'SpeciesIndex', 1, 70)
In [29]:
clusterization_results = perform_all_clusterization_types(df_iris, 'SepalLengthCm', 'PetalLengthCm', 'SpeciesIndex', 1, 70)

Results centralization

In [30]:
results_df = pd.DataFrame({'Clustering_algorithm': ['Single linkage', 'Average linkage', 'Complete linkage', 'Ward variance',
                                                    'DBSCAN', 'K-means', 'EM', 'SOM'],
                           'Result': clusterization_results}).sort_values(by=['Result'], ascending=False)
results_df
Out[30]:
Clustering_algorithm Result
6 EM 0.772004
2 Complete linkage 0.732298
7 SOM 0.699237
5 K-means 0.698863
3 Ward variance 0.642251
4 DBSCAN 0.568116
1 Average linkage 0.565886
0 Single linkage 0.565747

Performance analysis

As seen in the above table, EM, K-means and SOM obtained good results due to the spherical nature of all of the three clusters (the best separation can be observed for the projection of the PethalLength and SepalLength coordinates).

Since 2 of the clusters are very close to each other, single linkage will group them into a single cluster, same for DBSCAN algorithm which due to the density of the clusters, the lack of the outliers is emphasized.

The other hierarchical clustering methods performs similarily, with some differences caused by the different intracluster metric.

Dataframe long.data

Visualization of each clusterization model

In [31]:
clusterization_results = perform_all_clusterization_types(df_long, 'Feature1', 'Feature2', 'ClusterIndex', 4, 30)

Results centralization

In [32]:
results_df = pd.DataFrame({'Clustering_algorithm': ['Single linkage', 'Average linkage', 'Complete linkage', 'Ward variance',
                                                    'DBSCAN', 'K-means', 'EM', 'SOM'],
                           'Result': clusterization_results}).sort_values(by=['Result'], ascending=False)
results_df
Out[32]:
Clustering_algorithm Result
6 EM 0.015402
2 Complete linkage 0.009077
3 Ward variance 0.000155
1 Average linkage 0.000040
7 SOM 0.000025
0 Single linkage 0.000000
4 DBSCAN 0.000000
5 K-means -0.000805

Performance analysis

Even though it is not very obvious from the clusterization plots, it must be specified the fact that both of those two clusters are very elongate (on the Oy the maximum distance between two point is 2, while on the ox axis is 8). The clusters are very close on the Oy axis, and they are parallel.

These are the arguments for which K-means, SOM, EM or complete linkage perform very bad, by mixing 1/2 of a real cluster with another 1/2 of the other real cluster. And this is why the performance result reaches almost a null value.

Regarding DBSCAN or single linkage, since both of the clusters are close, the inner clusters are grouped into a single cluster. The number of outliers is really small, and DBSCAN will group into isolated clusters only one or two samples.

Dataframe order2-3clust.csv

Visualization of each clusterization model

In [33]:
clusterization_results = perform_all_clusterization_types(df_order2_3c, 'Feature1', 'Feature2', 'ClusterIndex', 2, 40)

Results centralization

In [34]:
results_df = pd.DataFrame({'Clustering_algorithm': ['Single linkage', 'Average linkage', 'Complete linkage', 'Ward variance',
                                                    'DBSCAN', 'K-means', 'EM', 'SOM'],
                           'Result': clusterization_results}).sort_values(by=['Result'], ascending=False)
results_df
Out[34]:
Clustering_algorithm Result
6 EM 0.380882
3 Ward variance 0.312760
7 SOM 0.301675
2 Complete linkage 0.292457
5 K-means 0.242237
1 Average linkage 0.179046
4 DBSCAN 0.112852
0 Single linkage 0.003978

Performance analysis

The data is formed of two intertwined parabolas, and one globular cluster in the middle of the second parabola.

Even though we would've said that a single linkage model would've performed best due to the chaining nature of the points on the parabolas, because the clusters are close it groups all of the clusters into a single one.

DBSCAN manages to cluster correctly only the spheric cluster, even though some points placed on the middle parabola are marked as belonging to the previously mentioned cluster. It can be noticed a lot of noise and outliers points due to the big differences between oy coordinates, and much smaller ones on the ox axis.

The rest of the algorithms group the samples similarily, even though incorrectly due to the proximity of the inner clusters.

Dataframe smile.csv

Visualization of each clusterization model

In [35]:
clusterization_results = perform_all_clusterization_types(df_smile, 'Feature1', 'Feature2', 'ClusterIndex', 1, 40)

Results centralization

In [36]:
results_df = pd.DataFrame({'Clustering_algorithm': ['Single linkage', 'Average linkage', 'Complete linkage', 'Ward variance',
                                                    'DBSCAN', 'K-means', 'EM', 'SOM'],
                           'Result': clusterization_results}).sort_values(by=['Result'], ascending=False)
results_df
Out[36]:
Clustering_algorithm Result
0 Single linkage 1.000000
4 DBSCAN 1.000000
6 EM 0.266229
1 Average linkage 0.214779
7 SOM 0.212223
3 Ward variance 0.204147
2 Complete linkage 0.190593
5 K-means 0.185185

Performance analysis

Because the samples from the data are generated simetrical in a bidimensional space, and the clusters are concentrated and isolated one of each other, the single linkage clusters the data perfectly. Same for the DBSCAN model.

If the radius of the circle would've been bigger, K-means, or ward variance or SOM would've clustered correclty the sperichal clusters (the eyes from the face). Since the radius is aproximately 2, and the eyes and the mouth are in the proximity of the border of the face, the clusterization is done as like the figure would've been split into 4 squares with the centroinds in the middle of the squares, this is the reason why the rest of the methods perform similarily, with only a few differences.

Dataframe square.data

Visualization of each clusterization model

In [37]:
clusterization_results = perform_all_clusterization_types(df_square, 'Feature1', 'Feature2', 'ClusterIndex', 4, 30)

Results centralization

In [38]:
results_df = pd.DataFrame({'Clustering_algorithm': ['Single linkage', 'Average linkage', 'Complete linkage', 'Ward variance',
                                                    'DBSCAN', 'K-means', 'EM', 'SOM'],
                           'Result': clusterization_results}).sort_values(by=['Result'], ascending=False)
results_df
Out[38]:
Clustering_algorithm Result
5 K-means 0.693180
6 EM 0.675488
1 Average linkage 0.603744
2 Complete linkage 0.550484
3 Ward variance 0.523164
7 SOM 0.344892
4 DBSCAN 0.001279
0 Single linkage 0.000001

Performance analysis

DBSCAN and single linkage perform similarily, by grouping all 4 clusters into a single one, with the mention that DBSCAN will group into isolated clusters a lot of outliers. This happens due to the concenration of the samples distributed among 4 squared circles, whose borders are overlapping.

K-means obtained best results, along with EM which observed the curves of some normal distributions in the data. The rest of methods perform similarily.

Finding the inner number of clusters

Using the Elbow method and Sillouette Width, I am going to identify the correct number of clusters for K-means and for an hierarchical clusterization algorithm. The dataset user will be df_2d10c.

In [39]:
# config the parameters for the clusterizations
dataframe = df_2d10c
name_for_feature1 = 'Feature1'
name_for_feature2 = 'Feature2'
true_labels_column = 'ClusterIndex'
true_labels = dataframe[true_labels_column]
feature1 = dataframe[name_for_feature1]
feature2 = dataframe[name_for_feature2]
data = dataframe[[name_for_feature1, name_for_feature2]]

K-means

Elbow method

The Elbow method is a very popular technique and the idea is to run k-means clustering for a range of clusters k (let’s say from 1 to 20) and for each value, we are calculating the sum of squared distances from each point to its assigned center(distortions).

When the indexes are plotted and the plot looks like an arm then the “elbow”(the point of inflection on the curve) is the best value of k.

In [40]:
import tqdm

inertias = [] # the mean squared distances between each instance and its closest centroid

K = range(1,20)
for k in tqdm.tqdm(K):
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(data)
    inertias.append(kmeanModel.inertia_) 
100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:03<00:00,  4.89it/s]
In [41]:
matplotlib.style.use('seaborn-whitegrid') 
fig = plt.figure(figsize=(16,4))
ax = fig.add_subplot(111)

plt.plot(K, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('The Elbow Method showing the optimal k')

ax.axvline(4, color='red', linestyle='dashed', linewidth=1)
plt.show()

As noticed in the above diagram, curve looks like an arm and the number of clusters to be chosen over there should be equal to 4 as after that curve reaches a plateau. I am going to verify this with an automatic visualization mathod.

In [42]:
from yellowbrick.cluster import KElbowVisualizer
matplotlib.style.use('seaborn-whitegrid') 

model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,20))
visualizer.fit(data)        
visualizer.show()   
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x20394ebaeb8>
In [43]:
matplotlib.style.use('seaborn') 

n_clusters = 4
params_clustering_model = [n_clusters, data.values]
params_scatter_plot = [feature1, feature2, true_labels, name_for_feature1, name_for_feature2]

# overview on the real clusterization
fig, axs = plt.subplots(1, 2, figsize=(20, 10))
see_true_clusterization(*params_scatter_plot, fig, axs[0], 15)

# k-means with given k-partitions number
model_kmeans, kmeans_adj_rand_index = perform_kmeans_known_n_clusters(*params_clustering_model, 
                                                                      *params_scatter_plot, 
                                                                      'Clusterization using K-means with optimal K', 
                                                                      fig, axs[1], 20)

Silhouette Width

Silhouette analysis can be used to study the separation distance between the resulting clusters. The coefficient varies between -1 and 1. A value close to 1 implies that the instance is close to its cluster is a part of the right cluster. Whereas, a value close to -1 means that the value is assigned to the wrong cluster.

Silhouette Coefficient = $\frac{x-y}{max(x,y)}$ where, y is the mean intra cluster distance: mean distance to the other instances in the same cluster. x is the mean nearest cluster distance i.e. mean distance to the instances of the next closest cluster.

In [44]:
from sklearn.metrics import silhouette_score

silhouette_scores = []

K = range(2,20)
for k in tqdm.tqdm(K):
    kmeanModel = KMeans(n_clusters=k)
    labels_pred = kmeanModel.fit_predict(data)
    silhouette_scores.append(silhouette_score(data.values, labels_pred)) 
100%|██████████████████████████████████████████████████████████████████████████████████| 18/18 [00:05<00:00,  3.10it/s]
In [45]:
matplotlib.style.use('seaborn-whitegrid') 
fig = plt.figure(figsize=(20,4))
ax = fig.add_subplot(111)

plt.plot(K, silhouette_scores, 'bx-')
plt.xlabel('k')
plt.ylabel('Silhouette Scores')
plt.title('Silhouette width scores distribution among given K')

ax.axvline(np.argmax(silhouette_scores) + 2, color='red', linestyle='dashed', linewidth=1)
plt.show()

As per this method k=3 was a local optima, whereas k=15 should be chosen for the number of clusters. This method is better as it makes the decision regarding the optimal number of clusters more meaningful and clear. But this metric is computation expensive as the coefficient is calculated for every instance.

In [46]:
n_clusters = np.argmax(silhouette_scores) + 2
params_clustering_model = [n_clusters, data.values]
params_scatter_plot = [feature1, feature2, true_labels, name_for_feature1, name_for_feature2]

# overview on the real clusterization
matplotlib.style.use('seaborn') 
fig, axs = plt.subplots(1, 2, figsize=(20, 10))
see_true_clusterization(*params_scatter_plot, fig, axs[0], 15)

# k-means with given k-partitions number
model_kmeans, kmeans_adj_rand_index = perform_kmeans_known_n_clusters(*params_clustering_model, 
                                                                      *params_scatter_plot, 
                                                                      'Clusterization using K-means with optimal K', 
                                                                      fig, axs[1], 20)

Single linkage

In [47]:
def perform_hierarchical_clusteriz_after_selection(ax, fig, optimal_n_index, size, ns_list):
    n_clusters = ns_list[optimal_n_index]
    params_clustering_model = [n_clusters, data.values]
    labels_pred = hierarchycal_clustering(*params_clustering_model, intracluster_metrics[optimal_n_index])
    plot, adj_rand_index = visualize_clusterization(*params_scatter_plot, labels_pred, 
                                                                   'Clusterization using ' + 
                                                                   intracluster_metrics[optimal_n_index] + 
                                                                   ' hierarchical clusterization', 
                                                                   fig, ax, size)
In [48]:
def visualize_clusterization_after_selection(optimal_ns):
    # overview on the real clusterization
    fig, axs = plt.subplots(1, 2, figsize=(20, 10))
    see_true_clusterization(*params_scatter_plot, fig, axs[0], 15)

    for i in range(len(optimal_ns)):
        if i % 2 == 1:
            if i < len(optimal_ns) - 1:
                fig, axs = plt.subplots(1, 2, figsize=(20, 10))
                perform_hierarchical_clusteriz_after_selection(axs[0], fig, i, 15, optimal_ns)
            else:
                fig, axs = plt.subplots(1, 1, figsize=(7.5, 9))
                perform_hierarchical_clusteriz_after_selection(axs, fig, i, 13, optimal_ns)
        else:
            perform_hierarchical_clusteriz_after_selection(axs[1], fig, i, 15, optimal_ns)
            
intracluster_metrics = ['single', 'average', 'complete', 'ward']

Elbow method

In [49]:
def perform_elbow_hierarchical(intracluster_metric, ax):
    model = AgglomerativeClustering(affinity='euclidean', linkage=intracluster_metric)
    visualizer = KElbowVisualizer(model, ax=ax, k=(1,12), title='Distortion score Elbow for ' +
                                  intracluster_metric + ' hierarchical clusterization')
    visualizer.fit(data)        
    visualizer.finalize()   

matplotlib.style.use('seaborn-whitegrid') 

fig, axs = plt.subplots(1, 2, figsize=(17, 6))
perform_elbow_hierarchical(intracluster_metrics[0], axs[0])
perform_elbow_hierarchical(intracluster_metrics[1], axs[1])
plt.show()

fig, axs = plt.subplots(1, 2, figsize=(17, 6))
perform_elbow_hierarchical(intracluster_metrics[2], axs[0])
perform_elbow_hierarchical(intracluster_metrics[3], axs[1])
plt.show()    
In [50]:
optimal_ns_elbow = [5,5,2,3]
In [51]:
matplotlib.style.use('seaborn') 
visualize_clusterization_after_selection(optimal_ns_elbow)

Silhouette Width

In [52]:
def perform_silhouette_hierarchical_method(intracluster_metric):
    silhouette_scores = []

    no_of_clusters = range(2,20)
    for n in no_of_clusters:
        cluster = AgglomerativeClustering(n_clusters=n, affinity='euclidean', linkage=intracluster_metric)
        labels_pred = cluster.fit_predict(data)
        silhouette_scores.append(silhouette_score(data.values, labels_pred)) 

    fig = plt.figure(figsize=(20,4))
    ax = fig.add_subplot(111)

    plt.plot(no_of_clusters, silhouette_scores, 'bx-')
    plt.xlabel('n_clusters')
    plt.ylabel('Silhouette Scores')
    plt.title('Silhouette width scores distribution among given number of clusters for ' + 
              intracluster_metric + ' hierarchical clusterization')

    ax.axvline(np.argmax(silhouette_scores) + 2, color='red', linestyle='dashed', linewidth=1)
    plt.show()
    return np.argmax(silhouette_scores) + 2

matplotlib.style.use('seaborn-whitegrid') 
optimal_ns_silhouette = [perform_silhouette_hierarchical_method(metric) for metric in intracluster_metrics]
In [53]:
optimal_ns_silhouette
Out[53]:
[2, 15, 5, 15]
In [54]:
matplotlib.style.use('seaborn') 
visualize_clusterization_after_selection(optimal_ns_silhouette)

Since the dataframe 2d_10c is formed mostly of gaussian distributions, hierarchical clustering won't manage to separate properly the groups by combining half of the true cluster with half of another one and so on. Ward clustering seems to perform better, comparing with the rest. Since the gaussian curbes are very close one of each other, single linkage will group all of the samples into a single cluster, and only a 2 outliers will appear isolated into another cluster.